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Abstract —Many networks display community structure which 
identifies groups of nodes within which connections are denser 
than between them. Detecting and characterizing such community 
structure, which is known as community detection, is one of 
the fundamental issues in the study of network systems. It has 
received a considerable attention in the last years. Numerous 
techniques have been developed for both efficient and effective 
community detection. Among them, the most efficient algorithm is 
the label propagation algorithm whose computational complexity 
is 0(|f5|). Although it is linear in the number of edges, the 
running time is still too long for very large networks, creating the 
need for parallel community detection. Also, computing commu¬ 
nity quality metrics for community structure is computationally 
expensive both with and without ground truth. However, to date 
we are not aware of any effort to introduce parallelism for 
this problem. In this paper, we provide a parallel toolkifl to 
calculate the values of such metrics. We evaluate the parallel 
algorithms on both distributed memory machine and shared 
memory machine. The experimental results show that they yield 
a significant performance gain over sequential execution in terms 
of total running time, speedup, and efficiency. 

I. Introduction 

Many networks, including Internet, citation networks, 
transportation networks, e-mail networks, and social and bio¬ 
chemical networks, display community structure which iden¬ 
tifies groups of nodes within which connections are denser 
than between them m. Detecting and characterizing such 
community structure, which is known as community detection, 
is one of the fundamental issues in the study of network 
systems. Community detection has been shown to reveal latent 
yet meaningful structures in networks 10. 

Thus, numerous techniques were developed for both effi¬ 
cient and effective community detection, including Modularity 
Optimization El, a, Clique Percolation a, Local Expansion 
0,0, Euzzy Clustering 0,0, Link Partitioning ifTOl . and 
Label Propagation ED, ED, E3. Among them, the most 
efficient algorithm is the label propagation algorithm whose 
computational complexity is 0(|£'|), where \E\ is the number 
of edges in the network. Although it is a linear algorithm, 
the running time is still too long for very large networks. 
The primary examples are online social networks that are 
increasingly popular, the largest beii^ Eacebook with more 
than 800 million daily active usero The WWW forms a 

* Plea.se contact Mingming Chen via mileschen2008@gmail.com for the 
parallel toolkit if you are interested in it. 

^Facebook company info: http://newsroom.fb.com/company-info/ 


network of hyperlinked webpages in excess of 30 billion 
nodes. Therefore, parallelism was introduced into community 
detection to alleviate computational costs El, El, M, 
E3, El- However, to date we are not aware of any effort 
that provides parallel computation for the community quality 
metrics with and without ground truth community structure 
ED, EQl, HID, though it is computational expensive to do so. 
Hence, in this paper, we provide a parallel toolkit to calculate 
the values of these metrics. Although we are using parallel 
computing to speed up the processing, in most of the cases, 
algorithms are highly parallelizable, so the contributions of 
this paper focus on making the highly efficient social network 
analysis tools available to research community. 

We implement the parallel algorithms with MPI (Mes¬ 
sage Passing Interface) and Pthreads (POSIX Threads). We 
perform runs on both distributed memory machine, such as 
Blue Gene/Q, and shared memory machine, like GANXIS. 
The network we adopt is LER benchmark network Il22ll . The 
LER benchmark network for testing the parallel programs 
to calculate the metrics with ground truth community struc¬ 
ture has 100,000 nodes. We choose two sizes, ten million 
of nodes (10,000,000) and one hundred million of nodes 
(100,000,000), of the LER benchmark network to test the par¬ 
allel programs for computing the metrics without ground truth 
communities. The experimental results show that both parallel 
MPI algorithms and parallel Pthreads algorithms yield a signif¬ 
icant performance gain over sequential execution. Moreover, 
we recommend using parallel MPI algorithms and parallel 
Pthreads algorithm respectively to calculate the metrics with 
and without ground truth communities on GANXIS (or shared 
memory machines) in terms of their speedup and efficiency. 

II. Community Quality Metrics 
A. Metrics with Ground Truth Communities 

The quality evaluation metrics with ground truth com¬ 
munity structure we consider here can be divided into three 
categories: Variation of Information {VI) and Normalized 
Mutual Information (NMI) based on information theory; F- 
measure and Normalized Van Dongen metric (NVD) based 
on cluster matching; Rand Index (RI), Adjusted Rand Index 
{ART), and Jaccard Index (JI) based on pair counting E^ . 

1) Information Theory Based Metrics: Given partitions 
C and C, VI quantifies the “distance” between those two 
partitions, while NMI measures the similarity between them. 
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where \V\ is the number of nodes in the network, |c| is the 
number of nodes in community c of C, and |c n c'| is the 
number of nodes both in community c of C and in community 
c' of C. The computational complexity to calculate VI and 
NMI is 0(|F||C"|), where \C'\ is the number of communities 
found by a community detection algorithm. 


2) Cluster Matching Based Metrics: Measures based on 
cluster matching aim at finding the largest overlaps between 
pairs of communities of two partitions C and C. F-measure 
measures the similarity between two partitions, while NVD 
quantifies the “distance” between partitions C and C. F- 
measure is defined as 

F-meusure{C, C) = g |„| (3, 

NVD is given by 
NVD{C, C) = 1-^ 
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The complexity to calculate F-measure and NVD is 
0(|F|(|C| + |C"|)), where \C\ is the number of communities 
in the ground truth community structure. 


3) Pair Counting Based Metrics: Metrics based on pair 
counting count the number of pairs of nodes that are classified 
(in the same community or in different communities) in two 
partitions C and C . Let an indicate the number of pairs 
of nodes that are in the same community in both partitions, 
aio denote the number of pairs of nodes that are in the same 
community in C but in different communities in C", aoi be the 
number of pairs of nodes which are in different communities in 
C but in the same community in C", ago be the number of pairs 
of nodes which are in different communities in both partitions. 
By definition, A = an + oio + ooi + ooo = is the 

total number of pairs of nodes in the network. Then, III which 
is the ratio of the number of node pairs placed in the same 
way in both partitions to the total number of pairs is given by 

= (5) 

Denote M = \{aii +aio)(aii +aoi). Then, RFs correspond¬ 
ing adjusted version, ARI, is expressed as 

ARI{C, C) = y-- an-M -^ 
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,11 which is the ratio of the number of node pairs placed in 
the same community in both partitions to the number of node 
pairs that are placed in the same group in at least one partition 
is defined as 

JI{C,C') = --. (7) 

Oil + aio + ooi 


Each of these three metrics quantifies the similarity between 
two partitions C and C. The complexity to calculate RI, 
ARI, and JI is 0(|Ep). 


B. Metrics without Ground Truth Communities 

1) Newman’s Modularity: Modularity HI measures the 
difference between the actual fraction of edges within the com¬ 
munity and such fraction expected in a randomized graph with 
the same number of nodes and the same degree sequence. For 
the given community partition of a unweighted and undirected 
graph G = {V,E) with \E\ edges, modularity (Q) is given by 
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where C is the set of all the communities, c is a specific 
community in C, \E'^J^ \ is the number of edges between nodes 
within community c, and is the number of edges from 

the nodes in community c to the nodes outside c. 


2) Modularity Density: Modularity Density (Qds) 0201 . 
m is proposed to solve the two opposite yet coexisting 
problems of modularity: in some cases, it tends to favor 
small communities over large ones while in others, large 
communities over small ones. The latter tendency is also 
known as the resolution limit problem ll^ . For unweighted 
and undirected networks, Qds is defined as 
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In the above, is the internal density of community Ci, 
\Eci,cj\ is the number of edges from Ci to Cj, and dci.cj is 
the pair-wise density between communities Ci and Cj. 


3) Six Other Community Quality Measures: We also 
consider six other metrics without ground truth community 
structure, including the number of Intra-edges, Intra-density, 
Contraction, the number of Inter-edges, Expansion, and 
Conductance 1201, m, which characterize how community¬ 
like is the connectivity structure of a given set of nodes. All 
of them rely on the intuition that communities are sets of 
nodes with many edges inside and few edges outside. 

The number of Intra-edges: |i7™|; it is the total number of 
edges in community c. 

Intra-density: in Equation (|9]l. 

Contraction: 2|ii^™|/|c|; it measures the average number of 
edges per node inside the community c. 

The number of Inter-edges: it is the total number of 

edges on the boundary of c. 

Expansion: |i7°“*|/|c|; it measures the average number of 
edges (per node) that point outside the community c. 

\^out I 

Conductance: 2 |£»^+|£'°"t| ’ measures the fraction of the 
total number of edges that point outside the community. 
























III. Parallel Algorithm Design 
In this section, we present the parallel algorithms, MPI and 
Pthreads versions, to calculate the quality metrics introduced 
in Section |II] It can be seen from Section im that VI and 
NMI can be calculated together, F-measure and NVD can 
be computed together, RI, ARI, and JI can be calculated 
together, and the metrics without ground truth communities 
can be computed together. Thus, we will have four parallel 
algorithms based on MPI and four parallel algorithms based 
on Pthreads for these metrics. We denote the ground truth com¬ 
munity structure as C and the community structure detected 
with a community detection algorithm as C. 

A. Parallel Algorithms Based on MPI 

In the parallel algorithms based on MPI, the problem 
to calculate metrics is partitioned with the unit of com¬ 
munity. For the algorithms to calculate the metrics with 
ground truth community structure, each processor extracts the 
ground truth communities and the detected communities when 
{comid mod numProcs == procid) to achieve rough 
load balance, comid is the id of a community, procid is 
the id of a processor, and numProcs is the total num¬ 
ber of processors used. Hence, each processor will have 
\C\/numProcs or \C\/numProcs + 1 ground truth commu¬ 
nities and \C'\/numProcs or \C'\/numProcs + 1 discov¬ 
ered communities. For the algorithms to compute the metrics 
without ground truth communities, each processor extracts the 
discovered communities using the same approach. Also, each 
processor gets its local network which contains the nodes in its 
own communities and the neighboring nodes of these nodes. 


Algorithm 1 MPI_information_theory_metric(C'p, C'p) 

1: // Processor id is denoted as p. The number of processors 
used is denoted as numProcs. The local values of VI 
and NMI is denoted as rankVI and rankNMI. 

2: Calculate rankVI and rankNMI based on Equation ([T]i 
and Equation (|2l) using its own Cp and its own C^; 

3: // Circulate C'p in a ring. 

4: recvSrc = {p + numProcs — 1) mod numProcs', 

5: sendDst = (p -f 1) mod numProcs', 

6: receivedMsgNum = 0; 

7: while receivedMsgNum < (numProcs — 1) do 
8: sendBufW ^ C^; 

9: Send sendBufl] to sendDst', 

10: Receive C'p from recvSrc with recvBuf[]', 

11: Calculate rankVI and rankNMI based on Equa¬ 

tions O and (|2|i using its own Cp and received C'p', 

12: ++receivedM sgNum', 

13: end while 

14: Get VI and NMI by summing the rankVI and 
rankNMI of all numProcs processors; 

15: Return VI and NMI', 


We hrst show the parallel MPI algorithm to calculate the 
information theory based metrics, VI and NMI. Supposed 
there are N processors (or MPI ranks), each processor reads 
its own set of ground truth communities Cp and its own 
set of discovered communities C'p. It can be learnt from 
the dehnitions of d and NMI given by Equation ([T]i and 
Equation (|2]i respectively that each ground truth community 


should traverse all the discovered communities in order to get 
the values of them. Thus, in the algorithm, we circulate C'p to 
each processor in a “ring”. That is, processor 0 sends its C'p 
to processor 1, and processor 1 to processor 2, and processor 
N — 1 would send C'p to processor 0. This “shifting” of C'p 
will occur N — 1 times for N processors. Eor its own C'p and 
each received C, the processor will calculate its local values 
of VI and NMI with its own Cp. Einally, the values of VI 
and NMI are the sum of the local values of all N processors. 
Algorithm [T] shows our parallel algorithm for computing VI 
and NMI based on MPI. It takes Cp and C'p as parameters. 


Algorithm 2 MPI_cluster_matching_metric(Cp, C'p 

I: // Use maxNormedComsW and maxTComsW to record 
the max item for each ground truth community shown in 
Equations Q and (|4]i, respectively. 

2: Get maxNormedComsW and maxTComsW for each 
community in Cp with its own C'p 
3: recvSrc = (p + numProcs — 1) mod numProcs', 

4: sendDst = (p-f 1) mod numProcs', 

5: // Circulate C'p in a ring. 

6: receivedM sgNum = 0; 

7: while receivedM sgNum < (numProcs — 1) do 
8: sendBufl] ■(— Cp 

9: Send sendBufW to sendDst', 

10: Receive C'p from recvSrc with recvBufW', 

11: Update maxNormedComsW and maxTComs[] for 

each community in Cp with received Cp 
12: ++receivedM sgNum', 

13: end while 

14: // Use maxDComsW to record the max item for each 
detected community shown in Equation (01). 

15: Get maxDComs[] for each community in C'p with its own 
set of ground truth communities C^; 

16: // Circulate Cp in a ring. 

17: receivedMsgNum = O', 

18: while receivedM sg Num < (numProcs — 1) do 
19: sendBufW ■<— Cp; 

20 : Send sendBufW to sendDst', 

21 : Receive Cp from recvSrc with recvBufW', 

22: Update maxDComsW for each community in C'p with 

received Cp; 

23: ++receivedM sgNum', 

24: end while 

25: Calculate rankFMeasure and also rankNVD with 
maxNormedComsW, maxTCom.sW, and maxDComsW 
based on Equations (O and (0]); 

26: Get F-measure and NVD by summing rankF Measure 
and rankNVD of all numProcs processors; 

27: Return F-measure and NVD', 


We then present the parallel algorithm to calculate the 
cluster matching based metrics, F-measure and NVD. Erom 
the dehnitions of F-measure and NVD shown in Equation (IJl) 
and Equation (0]) respectively, we could learn that in order to 
calculate F-measure and NVD, we need to determine for each 
ground truth community the discovered community that has 
the largest number of common nodes with it. In addition, to 
calculate NVD, we further need to locate for each discovered 
community the ground truth community that has the largest 
number of common nodes with it. Hence, in the algorithm. 














both C!p and Cp are circulated to each processor in a “ring”. 
This “shifting” of C' and Cp will both occur N—1 times for N 
processors. For its own Cp and each received Cp, the processor 
will calculate its local values of F-measure and NVD with its 
own Cp. Moreover, for its own Cp and each received Cp, the 
processor will compute its local values of NVD with its own 
Cp. Finally, the values of F-measure and NVD are the sum 
of the local values of all N processors. Algorithmic shows our 
parallel MPl algorithm for computing F-measure and NVD. 
It takes Cp and Cp as parameters. 


Algorithm 3 MPI_pair_counting_metric(C'p(map), Cp{map)) 

1: Count rankAll, rankAlO, rankAOl, and rankAOO us¬ 
ing its own Cp{map) and its own Cp{map)', 

2: // Circulate Cp{map) in a ring. 

3: recvSrc = (p + numProcs — 1) mod numProcs', 

4: sendDst = (p -f 1) mod numProcs', 

5: receivedMsgNum = O', 

6: while receivedMsgNum < {numProcs — 1) do 
7: sendBufl] ■<— Cp{map)', 

8: Send sendBuf\\ to sendDst', 

9: Receive Cp{map) from recvSrc with recvBufW', 

10: Count rankAOl and rankAOO using its own Cp{map) 

and received Cp{map)', 

11: -\-\-receivedMsgNum', 

12 : end while 

13: Calculate rankRI, rankARI, and rankJI with 
rankAll, rankAlO, rankAOl, and rankAOO', 

14: Get RI, ARI, and JI by summing the rankRI, 
rankARI, and rankJI of all numProcs processors; 

15: Return RI, ARI, and JI; 


We now demonstrate how to calculate the pair counting 
based metrics, RI, ARI, and JI, in parallel with MPL To 
calculate RI, ARI, and JI, each node in the network needs 
to traverse all the other nodes so as to get an, aio, ooi, 
and aoo- Therefore, each processor reads its own ground truth 
communities and saves as a map with key being the node 
id and value being the id of the ground truth community to 
which this node belongs. We denote the map of nodes with 
their communities from ground truth community structure as 
Cp{map). This processor also reads the community informa¬ 
tion for the nodes in Cp{map) from discovered community 
structure and saves as a map. We denote the map of nodes 
with their communities from discovered community structure 
as Cp{map). Cp{map) and Cp{map) have the same subset 
of nodes but with their community information from ground 
truth community structure and detected community structure, 
respectively. In our implementation, we use hash indexed map 
in order to search the community that a node belongs to 
quickly. Since each node needs to traverse all the other nodes, 
thus in the algorithm Cp{map) is circulated to each processor 
in a “ring”. This “shifting” of Cp{map) will also occur iV — 1 
times for N processors. For each processor, each node in 
Cp{map) first traverses the other nodes in its own Cp{map) 
to count its local values of an, oio, aoi, and aoo- Then, this 
node will traverse the nodes in received Cp{map) to count 
only aoi and ago because this node is in a different ground 
truth community with the nodes in received Cp(map). With 
ail, 0 - 10 ^ 0 - 01 , and ago, the processor could get the local values 
of RI, ARI, and JI based on Equations (l5]l, (l6]l, and (l7]i- 


Finally, the values of RI, ARI, and JI are the sum of the local 
values of all N processors. Algorithm|3]shows our parallel MPl 
algorithm for computing RI, ARI, and JI. It takes Cp{map) 
and Cp{map) as parameters. 

Finally, we illustrate how to calculate the community 
quality metrics without ground truth community structure, such 
as modularity and Modularity Density, in parallel with MPl. 
From Section ITI-BI we could observe that in order to calculate 
the contribution of a community to these metrics, we only 
need to obtain the number of edges inside it, the number of 
edges on the boundary of it, the numbers of edges between 
it and its neighboring communities, and the sizes of it and its 
neighboring communities. There is no dependency between 
processors. Hence, there is no need to transfer communities or 
to transfer any message between processors. In the algorithm, 
each processor reads its own set of discovered communities 
Cp and its local network, and then calculate its own part for 
these metrics. At last, the values of these metrics are the sum 
of the local values of all N processors. We will not show the 
outline of this algorithm here because of its simplicity. 


Algorithm 4 Pthreads_pair_counting_metric(no(ies) 

1: // Thread id is denoted as threadid. The number of 
threads used is denoted as numTbreads. 

2: // The number of nodes in the network. 

3: numNodes = nodes.size{); 

4: for i = 0 to numN odes do 

5: if i mod numThreads == threadid then 

6: iNode = nodes[i\, 

7: // Traverse all the other nodes for iNode. 

8: for j = 0 to numN odes do 

9: jNode = nodes[j]; 

10: if iNode ^ jNode then 

11: Count all, alO, aOl, and aOO based on the 

community information of iNode and jNode 
from ground truth community structure and dis¬ 
covered community structure; 

12 : end if 

13: end for 

14: end if 

15: end for 

16: Calculate RI, ARI, and JI with all, alO, aOl, and aOO 
based on Equations Q, (O, and ((7]i; 

17: Return RI, ARI, and J/; 


B. Parallel Algorithms Based on Pthreads 

The parallel Pthreads algorithms for all the metrics, except 
the ones based on pair counting, assign subsets of ground 
truth communities and discovered communities, and also local 
network to each thread using the same approach adopted in 
the parallel MPl algorithms introduced in Section IIII-AI The 
difference between the parallel Pthreads algorithms and the 
parallel MPl algorithms is that the ground truth communities, 
the detected communities, and the network are globally ac- 
cessable in Pthreads, while they are locally stored in MPL 
The cores of the algorithms to calculate these metrics do not 
change compared with the parallel MPl algorithms, so we will 
not present their outlines here. 

In the parallel Pthreads algorithm to calculate the pair 
counting based metrics, the problem is partitioned with the 
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Fig. 1. The total running time, computation time, and message passing time of the three parallel MPI algorithms for computing the community quality metrics 
with ground truth community structure on GANXIS. 



(a) Speedup. (b) Efficiency. 

Fig. 2. The speedup and efficiency of the three parallel MPI algorithms for 
computing the metrics with ground truth community structure on GANXIS. 

unit of node instead of the unit of community. Then, each node 
traverses all the other nodes in the network to count all, alO, 
aOl, and aOO. The values of RI, ARI, and JI can be then 
calculated based on Equations Q, (| 6 ]l, and (|7]). The outline of 
this parallel Pthreads algorithm is shown in Algorithm 01 

IV. Evaluation and Analysis 

In this section, we first introduce the parallel architectures 
on which we perform runs for our parallel algorithms. Then, 
we introduce the measures that are used to evaluate the 
performance of these algorithms. We also give an introduction 
to LER benchmark network 1221 for which we calculate the 
metrics. Einally, we show the performance results of the 
parallel MPI and Pthreads algorithms. 

A. Parallel Computing Architectures 

We perform runs on both distributed memory machine, 
such as Blue Gene/Q, and shared memory machine, like 
GANXIS. We vary the number of processors used in GANXIS 
from 1 to 32 and the number of computing nodes (16 cores 
for each node) used in Blue Gene/Q from 1 to 256. 

1) GANXIS: GANXIS a hyper threaded Linux system 
operating on a Silicon Mechanics Rackform nServ A422.v3 
machine (GANXIS.nest.rpi.edu). Processing power was pro¬ 
vided by 64 cores organized as four AMD OpteronTM 6272 
(2.1 GHz, 16-core, G34, 16 MB L3 Cache) central processing 
units operating over a shared 512 GB of Random Access 
Memory (RAM) (32 x 16 GB DDR3-1600 ECC Registered 
2R DIMMs) running at 1600 MT/s Max. 

2 ) Blue Gene/Q: The Blue Gene/Q system that we used 
is stationed at The Computational Center for Nanotechnology 
Innovations facility at RPI, Troy, NY. 

B. Performance Measures 

We calculate speedup using Equation ( fTOl i 

Speedup = Ti/Tp (10) 


where Ti is the running time of the sequential program 
and Tp is the running time of the parallel program when p 
processors is adopted. We also compute efficiency according 
to Equation (fTTT i 

Efficiency = Speedup/p (11) 

where Speedup is the actual speedup calculated according to 
Equation ( fTOb and p is the number of processors. 

Notice that for our experimental results on Blue Gene/Q, 
p is denoted as the number of computing nodes adopted, Ti 
is the running time of the parallel program when only 1 node 
adopted, and Tp is the running time of the parallel program 
when p nodes adopted. 

C. LFR Benchmark Network 

We run our parallel MPI and Pthreads programs to calculate 
the metrics for LER benchmark networks ll22ll which have 
known ground truth community structure. The average node 
degree of the LER benchmark networks is set to be 15 and the 
maximum node degree is set to be 50. The exponent 7 for the 
degree sequence is 2. The exponent /3 for the community size 
distribution is 1. The mixing parameter p, is equal to 0.3. 

The LER benchmark network for testing the parallel pro¬ 
grams for calculating the metrics with ground truth community 
structure has 100,000 nodes. The ground truth community 
structure is given when generating the network. The discov¬ 
ered community structure is obtained by using a community 
detection algorithm called Speaker listener Label Propagation 
Algorithm (SLPA) IfT^ with threshold parameter r = 0.5. 
SLPA gets disjoint communities when r = 0.5. 

We choose two sizes, ten million of nodes (10,000,000) 
and one hundred million of nodes (100, 000, 000), of LER 
network to test the parallel programs to compute the metrics 
without ground truth communities. We calculate the values of 
these metrics for the ground truth community structure instead 
of for the discovered community structure since it takes too 
long to get the detected communities with SLPA. 

D. Experimental Results 

In this part, we will report the performance results of 
the parallel MPI and Pthreads algorithms for community 
quality metrics both with and without ground truth community 
structure. Eor the running time, we do not take into account 
the I/O time. That is, the time of the program to read the 
ground truth communities, the discovered communities, or 
the network. Moreover, since GANXIS has 64 processors. 












































(a) Total running time. (b) Computation time. (c) Message passing time. 

Fig. 3. The total running time, computation time, and message passing time of the three parallel MPI algorithms for computing the community quality metrics 
with ground truth community structure on Blue Gene/Q. 




(a) Speedup. (b) Efficiency. 

Fig. 4. The speedup and efficiency of the three parallel MPI algorithms 
for computing the community quality metrics with ground tnath community 
structure on Blue Gene/Q. 

every thread in our parallel Pthreads algorithms executes on 
its dedicated processor. Therefore, threads do not compete 
for central processing unit (CPU) processors. They execute 
in parallel, and we can completely ignore thread scheduling 
issues in our considerations. Because of this we use terms 
‘thread’ and ‘processor’ interchangeably when describing the 
results of the parallel Pthreads algorithms on GANXIS. 

1) Performance of Parallel Algorithms for Metrics with 
Ground Truth Community Structure: Figures |l(a)[ |l(b)| and 
|l(c)| respectively show the total running time, computation 
time, and message passing time of the three parallel MPI 
algorithms. Algorithm [T] Algorithm |2] and Algorithm [3 to 
compute the community quality metrics with ground truth 
communities on GANXIS. Figure [T(a)| indicates that the total 
running time decreases as the number of processors increases. 
Figure |l(b)| implies that the computation time decreases as 
the growth of the number of processors. However, it is shown 
in Figure |l(c)| that the message passing time goes as a saw 
shape when the number of processors grows. The saw behavior 
is the result of GANXIS architecture in which quads of 
cores uses one shared global memory module. Increasing the 
number of processors from 1 to 2 introduces the message 
passing for the first time, adding two more processors helps 
as the message are placed in the same share memory module 
for all four cores. In case of 8 processors, some messages 
start to be moving between two memory modules, which 
decreases the advantage of having more processors and so on. 
In addition. Figures |2(a)| and |2(b)| present the corresponding 
speedup and efficiency. From Figure |2(a)| we can observe that 
the speedup of all three algorithms grows as the number of 
processors increases. Figure |2(b)| indicates that the efficiency 
of Algorithm [3] first increases from 1 to 16 processors and then 
decreases from 16 to 32 processors, while the efficiency of the 
other two algorithms always decreases. It is worth noting that 
the speedup and efficiency of Algorithm[T]and Algorithm |2] are 


almost the same with each other, but both are much smaller 
than those of Algorithm [3] that is to calculate the pair counting 
based metrics. The efficiency of Algorithm[3is even larger than 
1, achieving a super-linear speedup. The super-linear speedup 
is the result of increasing larger cache available on many 
processors. As the number of processors increases, the volume 
of data processed on each processor decreases but cache is the 
same size. Thus, the number of cache misses decreases on each 
processor, speeding up the execution beyond linear speed up. 

Figures |3(a)| |3(b)| and |3(c)| respectively present the total 
running time, computation time, and message passing time of 
the three parallel MPI algorithms. Algorithm [T] Algorithmic 
and Algorithm |2 for computing the metrics with ground truth 
communities on Blue Gene/Q. Note that the x-axis is the 
number of nodes and each node has 16 processors. Thus, 
the number of processors is the number of nodes times 16. 
Figure [3(a)| demonstrates that the total running time decreases 
as the number of nodes grows. Figure |3(b)| implies that the 
computation time decreases as the growth of the number of 
nodes. However, there is no obvious trend of the message 
passing time. Moreover, Figures |4(a)| and |4(b)| show the 
corresponding speedup and efficiency. It can be observed from 
Figure |4(a)| that the speedup of the three algorithms grows 
when the number of nodes increases, except for Algorithm [T] 
and Algorithm|2]when there are 256 nodes, the reason of which 
is that the message passing time instead of the computation 
time is the dominant part of the total running time at this case. 
Figure |4(b)| implies that the efficiency of all three algorithms 
decreases as the growth of the number of nodes. 

Figures |5(a)| |5(b)| and |5(c)| respectively show the total 
running time, speedup, and efficiency of the three parallel 
Pthreads algorithms for computing the metrics with ground 
truth community structure on GANXIS. Figure |5(a)| implies 
that the total running time decreases when the number of 
processors increases. Figure |5(b)| indicates that the speedup 
increases as the growth of the number of processors. However, 
we could learn from Figure |5(c)| that the efficiency first grows 
from 1 to 2 processors and then decreases from 2 to 32 
processors. Comparing the total running time in Figure |5(a)| 
and in Figure |l(a)| we could see that the total running time 
of the three parallel Pthreads algorithms is generally larger 
than that of the three parallel MPI algorithms. Therefore, 
we recommend using the parallel MPI algorithms instead of 
the parallel Pthreads algorithms to calculate the metrics with 
ground truth community structure on GANXIS (or shared 
memory machines). Also, it is interesting that the speedup and 
efficiency of the parallel Pthreads algorithms to calculate the 
information theory and cluster matching based metrics shown 






























































































truth community structure on GANXIS. 



(a) Total running time. (b) Speedup. (c) Efficiency. 

Fig. 6. The total running time, speedup, and efficiency of the parallel 
MPI algorithm for computing the community quality metrics without ground 
truth community structure on GANXIS (The number of nodes of the LFR 
benchmark network is 10,000,000.). 



(a) Total running time. (b) Speedup. (c) Efficiency. 

Fig. 7. The total running time, speedup, and efficiency of the parallel 
Pthreads algorithm for computing the community quality metrics without 
ground truth community structure on GANXIS (The number of nodes of the 
LFR benchmark network is 10,000,000.). 

in Figures [5(b)| and [5(^ are larger than those of the correspond¬ 
ing parallel MPI algorithms shown in Figures |2(a)| and |2(b)| 
However, the speedup and efficiency of the parallel Pthreads 
algorithm to calculate the pair counting based metrics shown 
in Figures |5(b)| and |5(c)| are much smaller than those of the 
corresponding parallel MPI algorithm shown in Figures |2(a)| 
and |2(b)| This phenomenon leads to a interesting result that 
the speedup and efficiency of the parallel Pthreads algorithm 
to calculate the pair counting based metrics are smaller than 
those of the other two parallel Pthreads algorithms, while 
the speedup and efficiency of the parallel MPI algorithm for 
calculating the pair counting based metrics are much larger 
than those of the other two parallel MPI algorithms. 

2) Performance of Parallel Algorithms for Metrics without 
Ground Truth Community Structure: Figures |6(a)| |6(b)| and 
|6(c)| respectively show the total running time, speedup, and 
efficiency of the parallel MPI algorithm for calculating the 
metrics without ground truth community structure on GANXIS 
with the size of the LFR benchmark network being 10,000,000. 
Figure |6(a)| demonstrates that the total running time first de¬ 
creases from 1 to 16 processors and then increases a little from 
16 to 32 processors. Figure [6(b)| indicates that the speedup first 
grows from 1 to 16 processors and then decreases from 16 to 
32 processors. Thus, we can learn that there is a performance 
degradation when there are 32 processors. This performance 
penalty is again caused by the memory banks organization of 
GANXIS machine. Also, it can be seen from Figure [6(^ that 


the efficiency always decreases. 

Figures |7(a)| |7(b)| and |7(c)| present the corresponding 
total running time, speedup, and efficiency of the parallel 
Pthreads algorithm. We could obverse that the total running 
time decreases, the speedup grows, and the efficiency decreases 
as the number of processors increases. Also, the total running 
time, the speedup, and the efficiency of the parallel Pthreads 
algorithm are respectively much smaller, larger, and higher 
than those of the parallel MPI algorithm, compared the results 
between Figure |6] and Figure [T] 

Figures |8(a)[ |8(b)[ and |8(c)| respectively show the total 
running time, speedup, and efficiency of the parallel MPI 
algorithm for calculating the metrics without ground truth 
community structure on GANXIS with the size of the LFR 
benchmark network being 100,000,000. These three subfigures 
implies that the total running time decreases, the speedup 
increases, and the efficiency decreases as the growth of the 
number of processors. Similarly to the results shown in Fig¬ 
ure |6] Figure [8] demonstrates that there is also a performance 
degradation when there are 32 processors. 

Figures [9(a)| |9(b)| and |9(c)| present the corresponding total 
running time, speedup, and efficiency of the parallel Pthreads 
algorithm. It shows that the total running time decreases, 
the speedup grows, and the efficiency decreases when the 
number of processors increases. Comparing the results between 
Figure |8] and Figure |9] we could observe that the total running 
time, the speedup, and the efficiency of the parallel Pthreads 
algorithm are respectively much smaller, larger, and higher 
than those of the parallel MPI algorithm. 

When the size of the LFR benchmark network is 
10,000,000, the smallest running time the parallel MPI algo¬ 
rithm achieves is 21.06 seconds at 16 processors, while the 
smallest running time the parallel Pthreads algorithm achieves 
is 7.02 seconds at 32 processors. The largest speedup the 
parallel MPI algorithm gets is 8.87 at 16 processors, while the 
largest speedup the parallel Pthread algorithm gets is 20.9 at 
32 processors. When the size of the LFR benchmark network 
is 100,000,000, the smallest running time the parallel MPI 
algorithm can achieve is 208.82 seconds at 32 processors, 
while the smallest running time the parallel Pthreads algorithm 
can achieve is 65.66 seconds at 32 processors. The largest 
speedup the parallel MPI algorithm can get is 10.98 at 32 
processors, while the largest speedup the parallel Pthread 
algorithm can get is 23.92 at 32 processors. Thus, we rec¬ 
ommend using the parallel Pthreads algorithm instead of the 
parallel MPI algorithm to calculate the metrics without ground 
truth communities on GANXIS (or shared memory machines). 












































































(a) Total running time. (b) Speedup. (c) Efficiency. 

Fig. 8. The total running time, speedup, and efficiency of the parallel 
MPI algorithm for computing the community quality metrics without ground 
truth community structure on GANXIS (The number of nodes of the LFR 
benchmark network is 100,000,000.). 



(a) Total running time. (b) Speedup. (c) Efficiency. 

Fig. 9. The total running time, speedup, and efficiency of the parallel 
Pthreads algorithm for computing the community quality metrics without 
ground truth community structure on GANXIS (The number of nodes of the 
LFR benchmark network is 100,000,000.). 

There is another point we could get is that the speedup and the 
efficiency of both parallel MPI algorithm and parallel Pthreads 
algorithm grow as the size of the network increases. 

V. Conclusion 

In this paper, we provide a parallel toolkit, implemented 
with MPI and Pthreads, to calculate the community quality 
metrics with and without ground truth community structure. 
We evaluate their performance on both distributed memory 
machine, such as Blue Gene/Q, and shared memory ma¬ 
chine, for instance GANXIS. We conduct experiments on 
LFR benchmark networks with the number of nodes being 
100,000; 10,000,000; and 100,000,000. The experimental re¬ 
sults indicate that both the parallel MPI programs and the 
parallel Pthreads programs yield a signihcant performance 
gain over sequential execution. In addition, we discover that 
the parallel MPI algorithms perform better than the parallel 
Pthreads algorithms in terms of total running time, speedup, 
and efficiency on calculating the metrics with ground truth 
community structure, while the situation reverses on com¬ 
puting the metrics without ground truth community structure. 
Therefore, we recommend using the parallel MPI algorithms 
and the parallel Pthreads algorithm respectively to calculate the 
metrics with and without ground truth community structure. 
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